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History 


e a couple years ago, it looked hopeless 
е but then libv drew some triangles 
* we have come a long ways in the last year! 


• etnaviv 
• gallium driver for vivante 


e grate 
• gallium driver for tegra 


• та 
e classic/dri driver for mali 200/400 


* freedreno 
• gallium driver for adreno 2xx/3xx 
• plus xf86-video-freedreno and msm drm/kms 
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Etnaviv: Vivante 


е OpenGLES 2.0 
e GC2000+: OpenGLES 3.0 and OpenCL 


e Unified shader ISA 
e Vertex texture fetch 


e 2x/4x MSAA 
e IMR (not tiler) 


• Formats 
e Textures: 2D, cubemap 
e Texture compression: DXT1-5, ETC 
e Depth: 16b or 24b 
e Stencil: 8b 
e Index: 8b, 16b, or 32b 


e Modular 
E • 3D, 2D, compositing, and VG engines, each optional 


• But mostly talking about 3D and 2D O—áÓ—— | 
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Etnaviv: Devices 


e SolidRun CuBox (GC800) 
• Marvell Armada 510 SoC 
• 800MHz dual-issue ARM PJA 
* 1GiB DDR3 
e 1080p Video Decode Engine 
e HDMI, gigabit ethernet, eSata, etc 


e GK802 HDMI dongle (GC2000) 


e GCW Zero (GC860) 
e Ingenic JZ4770 1GHz MIPS processor 
e 3.5" LCD (320x240) 
e 512MiB DDR2 
e mini-HDMI, A/V port, 802.11b/g/n 


Z + Utilite (ітхб) 
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Etnaviv: Hardware 
ћ 4 | Мж. AU 


Host Interface 


1 + 


Memory Controller 


3-D Pipeline 
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Etnaviv: Shader ISA 


* Unified Shader 
е vec4 instructions 
e scalar integer instructions on СС2000 
e 128b instruction encoding 


e Precision: FP32 
; gl Position = mvpMatrix * in position 
MUL t4, ид, tO.xxxx, void 
MAD t4, ul, tO.yyyy, t4 


MAD 14, u2, 10.2222, 14 
MAD t4, из, tO.wwww, t4 
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Etnaviv: Status 


• Working gallium driver 
e But using fbdev backend only 
е Needs help for xorg Оох, DRM/DRI2 support, etc 


* Very fast progress 
e r/e work started late 2012 
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Grate: Tegra 


e OpenGLES 2.0 


• Separate Vertex and Fragment shaders 
e very minimalist: no loops, etc 
e but good performance through massive # of cores 


е 2x/4x MSAA (T40) 
e IMR (not tiler) 


• Formats 
• Textures: 20, cubemap 
• Depth: 20b (T30) or 246 (T40) 
• Stencil: 8b 
e Index: 8b, 16b 


жесі” EE = 


Grate: Devices 


œ legra2 
= œ AC-100 
* Trimslice 


• Tegra3 
e Nexus7 (original) 


° [egra4 
e Shield 


Grate: Vertex Shader ISA 


e VLIW vec4 (ALU) + scalar (SFU) co-dispatch 
e ALU: MOV, MUL, ADD, ОРЗ, РРА, etc 
e SFU: SIN, COS, RCP, RSQ, LG2, EX2 


e Precision: FP32 


; gl Position 
rl.xyzw, 
a БЭ ATE 
rl.xyzw, 
o0.xyZw, 


mul 
mad 
mad 
mad 


vo. 
vo. 


vo. 


720 


mvpMatrix * in position _ 


XXXX, CO.XYZW 
yyyy, cl.xyzw, 


ZZZZ, C2.XYZW, 
.WWWW, C3.XYZW, 


rl.xyzw 
rl.xyzw 
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Grate: Fragment Shader ISA 


• More weird, three different instruction streams 
e VAR/SFU - varying interpolate and special function unit 
e TEX - texture lookup 
* ALU - arithmetic logic unit 


• ALU - packets of 3 or 4 scalar instructions 
• Зх 64b instructions + embedded constant 
* or 4x 64b instructions 
• only four opcodes: 
e MAD: rD = ГА * rB + rC 
* MIN: rD = min(rA * rB, rC) 
• MAX: rD = max(rA * rB, rC) 
e CSEL: conditional select 


e Precision: FP20 


; gl FragColor = texture2D(tex, vec2(0.0)); 

; gl FragColor.r += gl FragColor.a > 0.5 2 gl FragColor.g : gl FragColor.b; 
ALU:002 mad r2.hl, #0, #1, #0 

ALU:002 mad r3.hl, #0, #1, #0 

TEX:002 tex 50 

ALU:003 mad lt r0. , -r3.h, #1, ecO 

ALU:004 спа rl.hl, -x0 half, r3.l, r2.h 

ALU:004 mad r2. l, dO.hl, #1, г2.1 

EXP:004 export alu 
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Grate: Status. 


e Early research stage 

command-stream capture and replay 

basic GL state understood 

vertex shader ISA understood 

main work is on fragment shader ISA currently 
basic gallium driver (clears) 


Lima: Mali 200/400/t6xx 


* Mali 200/400 
e OpenGLES 2.0 
e Separate Vertex (GP) and Fragment (PP) shaders 
* Mali 400 available with 1-4x PP 
* Tile based IMR (16x16) 
• Textures: 20, cubemap 


• Mali t6xx 
e OpenGLES 2.0 / 3.0 
е OpenCL 1.1 
• Unified Shader ISA 
• Various # of shader cores and ALU widths 
e Tile base IMR 
• Textures: 2D, cubemap, 3D 


Lima: Devices 


e Mali-t6xx 
e Samsung Exynos5 (Chromebook, Nexus10) 


e Mali-200/400 

AMLogic 8726-M (Zenithink C71) 

Allwinner A10 (Mele A1000, MK802) 
Samsung Exynos4 (Galaxy S2/S3/Tab/Note) 
Telechips 8902, 8803 


Lima: Hardware 


Asynch Mali-400 MP Top-Level 
APB 


Geometry Pixel Processor Pixel Processor ` ` Pixel Processor ` Pixel Processor 


Proce ssor #1 #2 | #3 | #4 
— MaliMMUs 
RESETs 
IRQs 
4 ——— 
IDLEs 
MaliL2 
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Lima: 200/400 Vertex Shader 


• Single-threaded but deeply pipelined 
• Each fixed length VLIW instruction has fields for: 


• 2 addition ALU's * 1 attribute load 
• 2 multiplication ALU's • 1 register load 
• 1 complex ALU e 1 uniform/temp load 


1 pass-through ALU 
e No explicit output registers 
e Outputs from ALU of previous instruction routed 
directly to ALU in current instruction 
. . ® 16 temporary registers to use when compiler 
27 schduling cannot route directly (1 Іоад/ѕіоге рег instr) 
© Precision: FP32 


1 varying/register store 


1 D . D 1 D 
= 
; gl Position mvpMatrix * in position 
uniform.load(1), attribute.load(0), mul[0].mul(uniform.z, attribute.y), 
mul[1].mul(uniform.y, attribute.y), store[0].register(0, mul[1].out, mul[0].out); 
uniform.load(1), mul[0].mul(uniform.x, attrib.y[1]), mul[1].mul(uniform.w, attrib.y[1]), 
store[1].register(0, mul[0].out, unused); 
attribute.load(0), mul[0].mul(uniform.z, attribute.x), mul[1].mul(uniform.y, attribute.x); 
attribute.load(0), mul[0].mul(uniform.x, attrib.x[1]), mul[1].mul(uniform.w, attrib.x[1]), 
acc[0].pass(mul[0].out[1]), acc[1].pass(mul[1].out[2]), pass.pass(mul[1].out[1]) 
store[1].register(0, unused, mul[0].out); 
mul[0].mul(uniform.z, attrib.z[1]), mul[1].mu 
acc[0].pass(attrib.z[1]), acc[1].add(mul[1 
pass.pass(uniform.x), store[0].register(4, 
attribute.load(0), register[1].load(0), mul[® 


uniform. load(0), 
uniform. load(0), 


uniform. load(2), (uniform.w, attrib.z[1]), 
.out[1], acc[1].out[1]), complex.pass(uniform.y), 
unused, mul[0].out); 


uniform. load(3), .mul(complex.out[1], acc[0].out[1]), 


mul[1].mul(uniform.w, attribute.w), acc[0].add(acc[0].out[2], register[1].y), 
acc[1].add(acc[1].out[1], mul[1].out[1]), complex.pass(pass.out[2]), 
store[1].register(4, mul[0].out, unused); 

uniform.load(3), register[0].load(0), register[1].load(0), mul[0].mul(pass.out[2], acc[0].out[2]), 
mul[1].mul(uniform.z, attrib.w[1]), acc[0].add(complex.out[1], register[1].x), 
acc[1].add(acc[1].out[1], mul[1].out[1]), complex.pass(attrib.w[1]), 
pass.pass(register[1].w), store[0].register(3, mul[1].out, unused); 

mul[0].pass(acc[0].out[1]), mul[1].pass(mul[0].out[1]), acc[0].pass(complex.out[1]), 
acc[1].pass(mul[0].out[2]), complex.pass(acc[0].out[2]) 

uniform.load(3), register[0].load(4), register[1].load(0), mul[0].complex2(acc[1].out[2], acc[1].out[2]), 
mul[1].mul(uniform.y, acc[0].out[1]), acc[0].add(pass.out[2], register[1].z), 
acc[1].add(complex.out[1], register[0].y), complex.rcp(acc[1].out[2]), pass.pass(acc[1].out[2]); 

uniform.load(3), mul.complexl(complex.out[1], mul[0].out[1], complex.out[1], pass.out[1]), 
acc[0].add(mul[0].out[2], acc[1].out[2]), acc[1].add(acc[0].out[1], mul[1].out[2]), 
complex.pass(uniform.x), pass.pass(acc[0].out[2]); 

uniform.load(4), register[0].load(0), register[1].load(3), mul[0].pass(uniform.y), 
mul[1].mul(complex.out[1], pass.out[1]), acc[0].add(acc[1].out[2], register[1].x), 
acc[1].add(acc[0].out[1], mul[1].out[2]), pass.pass(uniform.z); 

uniform.load(6), mul[0].mul(acc[0].out[1], pass.out[1]), mul[1].mul(acc[1].out[1], mul[0].out[1]), 


uniform.load(4), 
uniform.load(5), 


uniform.load(5), 


acc[1].add(acc[1 
mul[0].mul(acc[1 
mul[0].mul(mul[1 
acc[0].pass(pass.out[2]), 
mul[1].pass(acc[0].out[1]), 
acc[1].add(mul[1].out[1], 
store[0].varying(0, acc 


.out[2], 


mul[1].out[1]), pass.clamp(mul[0].out[2]) 
.out[1], uniform.x), mul[1].mul(mul[0].out[1], pass.out[1]); 
.out[2], pass.out[2]), mul[1 

acc[1].add(mul[1 
acc[0].add(mul[0] 

uniform.x), comp 
1].out, acc[0].out), store[1].varying(0, complex.out, mul[1].out); 


.mul(mul[0].out[1], pass.out[2 
.out[1], uniform.z); 

.out[1], uniform.y), 
ex.pass(acc[1].out[1]), 


), 


Lima: 200/400 Fragment Shader 


e 128 thread barrel processor 

• VLIW, variable length instr (32b aligned, up to 5766) 

e 6 vec4 instr (encoding supports up to 12) plus 4 special: 
e const О, соп5 1, tex sample result, unform fetch result 
e but pipeline registers - direct connection between two 

units іп pipeline 

e Precision: FP16 


; gl FragColor = clamp( 


7 


7 


vColor * texture2D(uTextureO, vTexCoord®), 
0.0, 1.0); 


$0 = varying[0]; 


varying[1].xy, 
^texture = sampler2D(0), 
$0 = ^vmul = clamp($0 * ^texture, 0.0, 1.0), 


sync, 


Stop; 


Lima: Status 


* Main focus so far: Mali 200/400 
• Mesa classic/dri driver starting to work 
• es2gears, textured cube, etc 
e Need to hook up cwabbott's compiler backend 


e Some preliminary investigations on Mali t6xx compiler 
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Freedreno: Adreno 2xx/3xx 
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• Adreno 2xx 
=, e OpenGLES 2.0 
• Unified Shader ISA 
e VLIW vec4 + scalar co-dispatch 


е Adreno 3xx 
e OpenGLES 2.0 / 3.0 
_ © OpenCL 1.1 (embedded profile, no double) 
^^ e Unified Shader ISA 


e explicitly pipelined scalar 3 


. * Common 

• Textures: 20, сибетар, 3D 

• 2x/4x MSAA 

e Tile Based IMR: 256KiB-1MiB GMEM/OCMEM 
e driver explicitly handles tiling (incl. restore/resolve) “боо 
e hw binning pass to avoid duplicated vertex processing D 
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77. Adreno 2хх 
LE Snapdragon 53 (НР TouchPad) 

e Freescale iMX5 (Quickstart, Efika-MX) 


e Adreno 3xx 
e Snapdragon S4 Pro (Nexus4, ifc6410) 
e Snapdragon-600 (Nexus7, Samsung Galaxy 54) 
e Snapdragon-800 (Nexus5, LG G2, Sony Xperia Z Ultra) 
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Freedreno: Tiling/GMEM 


e "large", flexible tile buffer 


e driver explicitly handles tiling 


• draw/clear cmds built up normally (like IMR) 
• flush: build restore/IB/resolve cmds 


IB — indirect branch 


GPU begins executing from here 
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* 96 bit encoding FETCH/ALU 
e 48 bit encoding CF 

• like r600, all CF instructions first 
e Precision: FP32 


EXEC ADDR (0x3) CNT(0x4) 
FETCH: VERTEX R1.xyzw = R0.x FMT 1 REVERSE UNSIGNED 
NORMALIZED STRIDE(0) CONST(1, 1) 
(S)ALU: MULv RO Rl.wwww, C3 
ALU: MULADDv RO RO, Rl.zzzz, C2 
ALU: MULADDv RO RO, Rl.yyyy, C1 


ALLOC POSITION SIZE(0x0) 


EXEC ADDR(0x7) СМТ(0х1) 
АЦЈ: MULADDv export62 = RO, Е1.хххх, CO ; gl Position 
ALLOC PARAM/PIXEL SIZE(0x0) 
EXEC END ADDR(Ox8) СМТ(0х0) 
МОР К 
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• 64 bit encoding, 7 instruction categories 
e catO: flow control, kill * cat1: mov/convert 
* cat2: ALU (add, mul..) e са 3: 3src ALU (mad, sel) 
e cat4: complex (rcp, rsq..) + cat5: tex sample 
e cat6: load/store/atomic 
* "repeat" field to increase instr density 
* Compiler responsible for pipelining 
* cat1-3: results avail 3 instr slots later (-1 for mad 3rd src) 
• cat4-5: special ss/sy sync bits for dependent instr 
* Precision: FP32/FP16/U32/U16/S32/S16 


; gl FragColor.x = dot(vl, v2); 
mul.f hrO.x, hrO.x, hrl.x 

пор 

mad.fl6 ћгд.х, ћгд.у, һг1.у, ћгд.х 
пор 

mad. f16 ћго.х, ћгд,2, hrl.z, hr0.x 
пор 

mad #16 ћго.х, hrO.w, hrl.w, hr0.x 
епа 


; gl FragColor = v1; 
(rpt3)mov.f16f16 hrO.x, (г)һг1.х 
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T e Initial msm drm/kms driver in v3.12 


• Working gallium driver 
• mesa 9.2 and master (but use master) 
e supports a2xx апа a3xx 
e supports either msm drm/kms or android kgsl/fodev 


• Xorg - xf86-video-freedreno 
e uses 2180 2d core on devices which have it 
* work-in-progress (but issues) XA state tracker | 
e Supports either msm drm/kms or android kgsl/fodev 
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e Wayland/Weston support 
• msm drm/kms only 


Hellfire IS 
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m. Supported 
7| e OpenGL 1.4 - on best-effort basis 
e OpenGL ES 1.0/2.0 


e Textures: 20, сиретар, 30 (incl mipmap) 


• TODO 
• MSAA 
Ка hw binning pass (game performance!) 
e compiler could be a lot better 


e Known to be working | 
o gnome-shell, xbmc, xonotic, openarena, etc ` 


• Working but minor issues >. 


4 • etuxracer, supertuxkart (MSAA issue) 
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[BOT]Hellfire CH 


Resources 


e etnaviv 
• Main Developer: Wladimir J. Van Der Laan (wumpus) 
* IRC: #etnaviv (freenode) 
e https://blog.visucore.com/ 
e https://github.com/laanwj/etna_ viv 


* grate 
* Main Developer: Erik Faye-Lund (kusma) 
e IRC: #lima (on freenode) 
• https://github.com/grate-driver/ 


• та 
e Developers: Luc Verhaegen 84110172 апа Connor Abbott (cwabbott) 
e IRC: #lima (on freenode) 
• http://limadriver.org/ 


e freedreno 
e Main Developer: Rob Clark (robclark) 
e IRC: #freedreno (freenode) 
e http://bloggingthemonkey.blogspot.com/ 
• http://freedreno.github.io/ 


